Feature Selection and LASSO

MSSC 6250 Statistical Machine Learning

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Feature/Variable Selection

When OLS Doesn’t Work Well

When \(p \gg n\), it is often that many of the features in the model are not associated with the response.

  • Model Interpretability: By removing irrelevant features \(X_j\)s, i.e., setting the corresponding \(\beta_j\)s to zero, we can obtain a model that is more easily interpreted. (Feature/Variable selection)

  • Least squares is unlikely to yield any coefficient estimates that are exactly zero.

Three Classes of Methods

  • Shrinkage. Fit a model that forces some coefficients to be shrunk to zero.
  • Dimension Reduction. Find \(m\) representative features that are linear combinations of the \(p\) original predictors (\(m \ll p\)), then fit least squares.
    • Principal component regression (Unsupervised) pls::pcr()
    • Partial least squares (Supervised) pls::plsr()

Subset Selection Method

Variable Selection

  • We have a large pool of candidate regressors, of which only a few are likely to be important.

Two “conflicting” goals in model building:

  • as many features as possible for better predictive performance on new data (smaller bias).

  • as few regressors as possible because as the number of regressors increases,

    • \(\mathrm{Var}(\hat{y})\) will increase (larger variance)
    • cost more in data collecting and maintaining
    • more model complexity

A compromise between the two hopefully leads to the “best” regression equation.

What does best mean?

There is no unique definition of “best”, and different methods specify different subsets of the candidate regressors as best.

Model Selection Criteria

  • The full (largest) model has \(M\) candidate regressors.

  • There are \(M \choose d\) possible subset models of \(d\) coefficients.

  • There are totally \(2^M\) possible subset models.

  • An evaluation metric should consider Goodness of Fit and Model Complexity:

Goodness of Fit: The more regressors, the better

Complexity Penalty: The less regressors, the better

  • Evaluate subset models:
    • \(R_{adj}^2\) \(\uparrow\)
    • Mallow’s \(C_p\) \(\downarrow\)
    • Information Criterion (AIC, BIC) \(\downarrow\)
    • PREdiction Sum of Squares (PRESS) \(\downarrow\) (Allen, D.M. (1974))

Selection Criteria: Mallow’s \(C_p\) Statistic \(\downarrow\)

For a model with \(d\) predictors,

\[\begin{align} C_p &= \frac{SS_{res}(d)}{\hat{\sigma}^2} - n + 2d \\ &= d + \frac{(s^2 - \hat{\sigma}^2)(n-d)}{\hat{\sigma}^2} \end{align}\]

  • \(\hat{\sigma}^2\) is the variance estimate from the full model, i.e., \(\hat{\sigma}^2 = MS_{res}(M)\).

  • \(s^2\) is the variance estimate from the model with \(d\) coefficients, i.e., \(s^2 = MS_{res}(d)\).

  • Favors the candidate model with the smallest \(C_p\).

  • For unbiased models that \(E[\hat{y}_i] = E[y_i]\), \(C_p = d\).

    • All of the errors in \(\hat{y}_i\) is variance, and the model is not underfitted.

Mallow’s \(C_p\) Plot

  • Model A is a heavily biased model.
  • Model D is the poorest performer.
  • Model B and C are reasonable.
  • Model C has \(C_p < 3\) which implies \(MS_{res}(3) < MS_{res}(M)\)

Selection Criteria: Information Criterion \(\downarrow\)

For a model with \(d\) predictors,

  • Akaike information criterion (AIC) is \[\text{AIC} = n \ln \left( \frac{SS_{res}(d)}{n} \right) + 2d\]

  • Bayesian information criterion (BIC) is \[\text{BIC} = n \ln \left( \frac{SS_{res}(d)}{n} \right) + d \ln (n)\]

  • BIC penalizes more when adding more variables as the sample size increases.

  • BIC tends to choose models with less features.

Selection Criteria: PRESS \(\downarrow\)

  • PREdiction Sum of Squares (PRESS)

  • \(\text{PRESS}_d = \sum_{i=1}^n[y_i - \hat{y}_{(i)}]^2 = \sum_{i=1}^n\left( \frac{e_i}{1-h_{ii}}\right)^2\) where \(e_i = y_i - \hat{y}_i\).1

  • PRESS is in fact the LOOCV estimate of test MSE!

  • \(R_{pred, d}^2 = 1 - \frac{PRESS_d}{SS_T}\)

Selection Methods: All Possible Selection

  • Assume the intercept is in all models.

  • If there are \(M\) possible regressors, we investigate all \(2^M - 1\) possible regression equations.

  • Use the selection criteria to determine some candidate models and complete regression analysis on them.

manpower <- read.csv(file = "./data/manpower.csv", header = TRUE)
lm_full <- lm(y ~ ., data = manpower)
olsrr_all <- olsrr::ols_step_all_possible(lm_full)
names(olsrr_all)
 [1] "mindex"     "n"          "predictors" "rsquare"    "adjr"      
 [6] "predrsq"    "cp"         "aic"        "sbic"       "sbc"       
[11] "msep"       "fpe"        "apc"        "hsp"       

   Index N     Predictors R-Square Adj. R-Square Mallow's Cp
3      1 1             x3    0.972         0.970       20.38
1      2 1             x1    0.971         0.970       21.20
2      3 1             x2    0.893         0.886      114.97
4      4 1             x4    0.884         0.877      125.87
5      5 1             x5    0.335         0.290      785.26
10     6 2          x2 x3    0.987         0.985        4.94
6      7 2          x1 x2    0.986         0.984        5.66
14     8 2          x3 x5    0.985         0.983        7.29
9      9 2          x1 x5    0.984         0.982        8.16
13    10 2          x3 x4    0.975         0.972       18.57
8     11 2          x1 x4    0.974         0.970       20.04
7     12 2          x1 x3    0.973         0.969       21.99
11    13 2          x2 x4    0.931         0.921       72.29
12    14 2          x2 x5    0.924         0.913       80.30
15    15 2          x4 x5    0.910         0.898       96.50
23    16 3       x2 x3 x5    0.990         0.988        2.92
18    17 3       x1 x2 x5    0.989         0.987        3.71
16    18 3       x1 x2 x3    0.987         0.984        6.21
22    19 3       x2 x3 x4    0.987         0.984        6.90
17    20 3       x1 x2 x4    0.986         0.983        7.66
20    21 3       x1 x3 x5    0.985         0.982        8.97
25    22 3       x3 x4 x5    0.985         0.982        9.00
21    23 3       x1 x4 x5    0.985         0.981        9.41
19    24 3       x1 x3 x4    0.978         0.974       16.80
24    25 3       x2 x4 x5    0.952         0.941       48.28
30    26 4    x2 x3 x4 x5    0.991         0.988        4.03
28    27 4    x1 x2 x4 x5    0.991         0.987        4.26
27    28 4    x1 x2 x3 x5    0.991         0.987        4.35
26    29 4    x1 x2 x3 x4    0.988         0.984        7.54
29    30 4    x1 x3 x4 x5    0.985         0.980       10.92
31    31 5 x1 x2 x3 x4 x5    0.991         0.987        6.00

Other Subset Selection Methods

Lasso (Least Absolute Shrinkage and Selection Operator)

Why Lasso?

  • All possible selection may be computationally infeasible.

  • Other subset selection methods do not explore all possible subset models. (No global solution)

  • Ridge regression does shrink coefficients, but still include all predictors.

  • Lasso regularizes coefficients so that some coefficients are shrunk to zero, doing feature selection.

  • Like ridge regression, for a given \(\lambda\), Lasso only fits a single model.

What is Lasso?

Different from the Ridge regression that adds \(\ell_2\) norm, Lasso adds \(\ell_1\) penalty on the parameters:

\[\begin{align} \widehat{\boldsymbol \beta}^\text{l} =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n \lambda \lVert\boldsymbol \beta\rVert_1\\ =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i' \boldsymbol \beta)^2 + \lambda \sum_{j=1}^p |\beta_j|, \end{align}\]

  • The \(\ell_1\) penalty forces some of the coefficient estimates to be exactly equal to zero when \(\lambda\) is sufficiently large, yielding sparse models.

\(\ell_2\) vs. \(\ell_1\)

  • Ridge shrinks big coefficients much more than lasso.
  • Lasso has larger penalty on small coefficients.

ElemStatLearn::prostate Data

lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
-0.580 2.77 50 -1.386 0 -1.39 6 0 -0.431 TRUE
-0.994 3.32 58 -1.386 0 -1.39 6 0 -0.163 TRUE
-0.511 2.69 74 -1.386 0 -1.39 7 20 -0.163 TRUE
-1.204 3.28 58 -1.386 0 -1.39 6 0 -0.163 TRUE
0.751 3.43 62 -1.386 0 -1.39 6 0 0.372 TRUE
-1.050 3.23 50 -1.386 0 -1.39 6 0 0.765 TRUE
0.737 3.47 64 0.615 0 -1.39 6 0 0.765 FALSE
0.693 3.54 58 1.537 0 -1.39 6 0 0.854 TRUE

cv.glmnet(alpha = 1)

lasso_fit <- cv.glmnet(x = data.matrix(prostate[, 1:8]), y = prostate$lpsa, nfolds = 10, 
                       alpha = 1)
Code
    plot(lasso_fit)
    plot(lasso_fit$glmnet.fit, "lambda")

Lasso Coefficients

  • lambda.min contains more nonzero parameters.

  • Larger penalty \(\lambda\) forces more parameters to be zero, and the model is more “sparse”.

coef(lasso_fit, s = "lambda.min")
9 x 1 sparse Matrix of class "dgCMatrix"
                  s1
(Intercept)  0.17999
lcavol       0.56099
lweight      0.61908
age         -0.02069
lbph         0.09531
svi          0.75180
lcp         -0.09912
gleason      0.04745
pgg45        0.00432
coef(lasso_fit, s = "lambda.1se")
9 x 1 sparse Matrix of class "dgCMatrix"
               s1
(Intercept) 1.101
lcavol      0.433
lweight     0.202
age         .    
lbph        .    
svi         0.271
lcp         .    
gleason     .    
pgg45       .    

One-Variable Lasso and Shrinkage: Concept

  • Lasso solution does not have an analytic or closed form in general.

  • Consider the univariate regression model

\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 + \lambda |\beta|\]

With some derivation, and also utilize the OLS solution of the loss function, we have

\[\begin{align} &\frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n (y_i - x_i b + x_i b - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n \Big[ \underbrace{(y_i - x_i b)^2}_{\text{I}} + \underbrace{2(y_i - x_i b)(x_i b - x_i \beta)}_{\text{II}} + \underbrace{(x_i b - x_i \beta)^2}_{\text{III}} \Big] \end{align}\]

One-Variable Lasso and Shrinkage: Concept

\[\begin{align} & \sum_{i=1}^n 2(y_i - x_i b)(x_i b - x_i \beta) = (b - \beta) {\color{OrangeRed}{\sum_{i=1}^n 2(y_i - x_i b)x_i}} = (b - \beta) {\color{OrangeRed}{0}} = 0 \end{align}\]

Our original problem reduces to just the third term and the penalty

\[\begin{align} &\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (x_ib - x_i \beta)^2 + \lambda |\beta| \\ =&\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \left[ \sum_{i=1}^n x_i^2 \right] (b - \beta)^2 + \lambda |\beta| \end{align} \] Without loss of generality, assume that \(x\) is standardized with mean 0 and variance \(\frac{1}{n}\sum_{i=1}^n x_i^2 = 1\).

One-Variable Lasso and Shrinkage: Concept

This leads to a general problem of

\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad (\beta - b)^2 + \lambda |\beta|,\] For \(\beta > 0\),

\[\begin{align} 0 =& \frac{\partial}{\partial \beta} \,\, \left[(\beta - b)^2 + \lambda |\beta| \right] = 2 (\beta - b) + \lambda \\ \Longrightarrow \quad \beta =&\, b - \lambda/2 \end{align}\]

\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| \le \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]

  • Lasso provides a soft-thresholding solution.

  • When \(\lambda\) is large enough, \(\widehat{\beta}^\text{l}\) will be shrunk to zero.

Objective function

The objective function is \((\beta - 1)^2\). Once the penalty is larger than 2, the optimizer would stay at 0.

Variable Selection Property and Shrinkage

  • The proportion of times a variable has a non-zero parameter estimation.

  • \(\mathbf{y}= \mathbf{X}' \boldsymbol \beta + \epsilon = \sum_{j = 1}^p X_j \times 0.4^{\sqrt{j}} + \epsilon\)

  • \(p = 20\), \(\epsilon \sim N(0, 1)\) and replicate 100 times.

Bias-Variance Trade-off

Comparing Lasso and Ridge

Constrained Optimization

  • For every value of \(\lambda\), there is some \(s\) such that the two optimization problems are equivalent, giving the same coefficient estimates.

  • The \(\ell_1\) and \(\ell_2\) penalties form a constraint region that \(\beta_j\) can move around or budget for how large \(\beta_j\) can be.

  • Larger \(s\) (smaller \(\lambda\)) means a larger region \(\beta_j\) can freely move.

Lasso \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_1 \end{align}\]


\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p|\beta_j| \leq s \end{align}\]

Ridge \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_2^2 \end{align}\]


\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p \beta_j^2 \leq s \end{align}\]

Geometric Representation of Optimization

What do the constraints look like geometrically?

When \(p = 2\),

  • the \(\ell_1\) constraint is \(|\beta_1| + |\beta_2| \leq s\) (diamond)
  • the \(\ell_2\) constraint is \(\beta_1^2 + \beta_2^2 \leq s\) (circle)

Source: https://stats.stackexchange.com/questions/350046/the-graphical-intuiton-of-the-lasso-in-case-p-2

Way of Shrinking (\(p = 1\) and standardized \(x\))

Lasso Soft-thresholding

\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| < \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]

Ridge Proportional shrinkage

\[\begin{align} \widehat{\beta}^\text{r} = \dfrac{b}{1+\lambda}\end{align}\]

Predictive Performance

Perform well (lower test MSE) when

Lasso ()

  • A relatively small number of \(\beta_j\)s are substantially large, and the remaining \(\beta_k\)s are small or equal to zero.

  • Reduce more bias

ISL Fig. 6.9

Ridge ()

  • The response is a function of many predictors, all with coefficients of roughly equal size.

  • Reduce more variance

ISL Fig. 6.8

Bayesian Interpretation

Lasso

Ridge

Let’s wait until we discuss Bayesian Regression!

Notes of Lasso

Warning

  • Even Lasso does feature selection, do not add predictors that are known to be not associated with the response in any way.

  • Curse of dimensionality. The test MSE tends to increase as the dimensionality \(p\) increases, unless the additional features are truly associated with the response.

  • Do not conclude that the predictors with non-zero coefficients selected by Lasso and other selection methods predict the response more effectively than other predictors not included in the model.

Other Topics

\[\lambda \left[ (1 - \alpha) \lVert \boldsymbol \beta\rVert_2^2 + \alpha \lVert\boldsymbol \beta\lVert_1 \right]\]

  • General \(\ell_q\) penalty: \(\lambda \sum_{j=1}^p |\beta_j|^q\)

ESL Fig. 3.12

Other Topics

  • Algorithms
    • Shooting algorithm (Fu 1998)
    • Least angle regression (LAR) (Efron et al. 2004)
    • Coordinate descent (Friedman et al 2010) (used in glmnet)
  • Variants
    • Adaptive Lasso
    • Group Lasso
    • Bayesian Lasso, etc.